Analysis of the World Happiness Report Dataset
PSTAT100 Final Project - Dec.2023

Descriptions to the Dataset - World Happiness Report Dataset¶

In [16]:
Image(filename = 'picture/1.0.1.jpg', width=800, height=400)
Out[16]:

Descripe the Observation:

The World Happiness Report dataset, spanning from 2008 to 2022, is a tidy and comprehensive collection, meticulously structured for analyzing global happiness trends. It encompasses a variety of indicators capturing economic status, social dynamics, health, freedom, generosity, perceptions of corruption, as well as measures of positive and negative emotions. Each row in the dataset represents a unique combination of country and year, making it ideal for longitudinal studies on how these factors impact overall well-being and happiness across nations over the 14-year period. For details on the specific variables included in this dataset, please refer to the table below.

Name Variable Description Type
Country name Name of the country Categorical
year Year in which the data was recorded Categorical
Life Ladder Overall happiness score, a measure of well-being Numeric
Log GDP per capita Logged Gross Domestic Product per capita, indicating economic status Numeric
Social support Degree of social support, based on social networks and individual's feeling of support Numeric
Healthy life expectancy at birth Expected number of years of healthy life at birth Numeric
Freedom to make life choices Degree of personal freedom in life choices Numeric
Generosity Level of generosity, often based on charitable giving assessments Numeric
Perceptions of corruption Perceptions of corruption, often based on the level of corruption in public and private sectors Numeric
Positive affect Frequency or intensity of experiencing positive emotions or affect Numeric
Negative affect Frequency or intensity of experiencing negative emotions or affect Numeric

Research Questions¶

Correlation Between Factors and Happiness Scores:¶

This analysis focuses on determining the correlation between a country's happiness score ('Life Ladder') and key socio-economic factors such as 'Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices', 'Generosity', and 'Perceptions of corruption', to understand their impact on national happiness levels.

In [26]:
# Correlation matrix
Image(filename = 'picture/2.1.1.jpg', width=500, height=500)
Out[26]:
In [20]:
# Summary
Image(filename = 'picture/2.1.2.jpg', width=600, height=400)
Out[20]:

Descripe the Observation(2.1):

The heatmap illustrates the correlation matrix of happiness scores and various socio-economic factors. It indicates that 'Life Ladder' (happiness score) has strong positive correlations with 'Log GDP per capita' (0.78), 'Social support' (0.72), and 'Healthy life expectancy at birth' (0.71), suggesting that higher economic status, better social networks, and longer healthy lifespans are associated with increased happiness. 'Freedom to make life choices' also shows a moderately positive correlation (0.53) with happiness scores. Interestingly, 'Generosity' has a very weak positive correlation (0.18) with happiness, indicating that it may not be as strongly linked to happiness as the other factors. Notably, 'Perceptions of corruption' are negatively correlated (-0.43) with happiness scores, suggesting that higher perceived corruption within a country is associated with lower happiness levels among its citizens. This heatmap provides a clear visual representation of how each factor is related to happiness, highlighting the complex interplay of economic, social, and political influences on well-being.

Happiness Trends Over Time in Different Regions:¶

The project will explore how happiness scores have evolved over time, particularly over the last 14 years. It will identify regions with the most significant positive or negative changes in happiness scores. Additionally, the study will employ heat maps to visualize these trends and analyze average changes across countries.

In [22]:
# Choropleth map
Image(filename = 'picture/2.2.1.jpg', width=800, height=400)
Out[22]:

Descripe the Observation(2.2):

The choropleth map presents a striking depiction of changes in global happiness, with Afghanistan's deep purple hue standing out, signaling a substantial decline in happiness scores. In contrast, China and Russia exhibit shades of orange, indicating an increase in happiness. Notably, several countries in North America are colored in shades of pink, reflecting a decline in happiness scores. The map also reveals that the most widespread declines are observed across African and Middle Eastern nations, while South America displays a mix of countries with both increases and decreases in happiness.

Differences in Happiness Scores Across Economic Blocs:¶

A comparative analysis will be conducted to examine if significant differences exist in happiness scores and other variables between economic blocs, particularly between OECD and non-OECD countries.

In [24]:
# barlot
Image(filename = 'picture/2.3.1.jpg', width=800, height=400)
Out[24]:

Descripe the Observation(2.3):

The provided visual data clearly indicates that OECD countries consistently outperform their non-OECD counterparts across several key metrics of societal well-being. For instance, OECD nations demonstrate higher average scores in the 'Life Ladder', reflecting greater overall life satisfaction. Additionally, economic and social indicators such as 'Log GDP per capita' and 'Social support' show higher means in OECD countries, suggesting a correlation between economic prosperity and the strength of social networks within these nations. A particularly stark difference is observed in 'Healthy life expectancy at birth', where OECD countries surpass non-OECD countries by an average of approximately 1.3 units, highlighting substantial disparities in health and welfare conditions.

Furthermore, OECD countries exhibit higher averages in 'Freedom to make life choices', which could point to greater personal liberties and autonomy. Despite the fact that OECD nations also appear to have higher levels of 'Generosity' and 'Positive affect', they maintain lower average scores in 'Perceptions of corruption' and 'Negative affect', suggesting a perception of lower corruption and fewer experiences of negative emotions. An intriguing observation in the "Ladder Difference" metric is that non-OECD countries are experiencing positive growth, possibly indicating progress in their development, while OECD countries show negative growth, potentially influenced by economic fluctuations. Overall, the data suggests that OECD countries enjoy a higher standard of well-being across economic, social support, health, and freedom dimensions, while non-OECD countries face greater challenges in these areas.

Predictive Analysis of Happiness Scores:¶

The project will explore the feasibility of predicting a country's happiness score based on various socio-economic indicators. This will involve employing statistical models to see if happiness("Life Ladder") can be forecasted from other measurable socio-economic factors.

In [31]:
# Regression model analysis: Linear vs. Random Forest
Image(filename = 'picture/2.4.1.jpg', width=700, height=300)
Out[31]:
In [32]:
# Actual vs Predicted Happiness Scores using Random Forest
Image(filename = 'picture/2.4.2.jpg', width=800, height=400)
Out[32]:

Descripe the Observation(2.4):

The Random Forest Regression model, with its superior predictive performance indicated by a Mean Squared Error (MSE) of 0.177016 and a Coefficient of Determination (R²) of 0.855188, stands out as the more robust model for forecasting a country's happiness score when compared to Linear Regression. The Random Forest model's ability to factor in these varying degrees of influence allows for nuanced predictions that are closer to reality, as evidenced by the lower MSE. Furthermore, the higher R² value confirms the model's capacity to account for a greater proportion of the variance in happiness scores. Additionally, from the 'Actual vs. Predicted Happiness Scores - Random Forest' scatter plot, it can be observed that the scattered points are concentrated near the diagonal line, further demonstrating the model's accuracy in predicting happiness scores.

The feature importance graph underscores 'Log GDP per capita' as the most influential factor, significantly overshadowing the contributions of other socio-economic factors like 'Healthy life expectancy at birth', 'Social support', 'Freedom to make life choices', 'Generosity', and 'Perceptions of corruption'. This hierarchy of features highlights the substantial role economic status plays in determining happiness, followed by health and social cohesion.

Cluster Analysis for Identifying Similar Happiness Traits:¶

Lastly, the study will investigate whether there are identifiable clusters of countries with similar happiness characteristics. This analysis will help in understanding if countries with similar socio-economic profiles also share similar levels of happiness.

In [34]:
# Actual vs Predicted Happiness Scores using Random Forest
Image(filename = 'picture/2.5.1.jpg', width=600, height=400)
Out[34]:
In [37]:
# Actual vs Predicted Happiness Scores using Random Forest
Image(filename = 'picture/2.5.2.jpg', width=1000, height=400)
Out[37]:

Descripe the Observation(2.5):

The clustering analysis of countries based on various socio-economic indicators reveals a profound segmentation that aligns closely with global economic and development trends. Cluster 2, encapsulating what are typically regarded as developed nations, including Australia, Austria, and Belgium, stands out with high income, high GDP per capita, and exceptionally high living standards (as indicated by the boxplot). These countries benefit from robust healthcare and education systems, contributing to their high rankings in 'Life Ladder,' 'Log GDP per capita,' 'Social support,' and 'Healthy life expectancy at birth.'

In stark contrast, Cluster 3 countries, such as Afghanistan, Benin, and Burkina Faso, represent the lower end of the global economic spectrum. Predominantly low-income, they grapple with numerous challenges, including economic instability, lower standards of living, and constrained access to healthcare and education. This is reflected in their lower scores across key well-being indicators, placing them at the opposite end of the spectrum compared to Cluster 2. The clear demarcation in socio-economic and well-being metrics between these clusters is indicative of global inequality in wealth distribution and access to essential services.

Clusters 0 and 1, comprising countries like Albania, Algeria, and Angola (Cluster 0), and a more diverse group including Argentina, Bahrain, and Belize (Cluster 1), occupy the middle ground in this global partition. Cluster 0 encompasses middle-income or transitional nations that show reasonable development in certain areas but still face numerous challenges. Cluster 1, with its mixed bag of middle to high-income economies, displays varying degrees of development and well-being. The variance in 'Freedom to make life choices,' 'Generosity,' 'Perceptions of corruption,' and affective states across these clusters further underscores the diversity within these categories. This clustering not only highlights the multifaceted nature of global development and well-being but also underscores the nuanced differences within groups that go beyond mere economic classifications.

Conclusion¶

Recall Important Observations:

The research conducted through a detailed analysis of the World Happiness Report dataset has led to significant insights into the determinants of national well-being. The dataset, encompassing a plethora of socio-economic indicators from 2008 to 2022, facilitated a comprehensive understanding of how factors such as economic status, social support, health, and freedom influence happiness across nations. The project identified clear trends and disparities: developed nations with robust economies and social structures consistently reported higher happiness scores, while countries grappling with economic and social challenges exhibited lower levels of well-being.

Through predictive analysis, the Random Forest Regression model emerged as a robust tool, outshining Linear Regression with a lower MSE and higher R², indicating greater accuracy and predictive power. The model highlighted the preeminent role of 'Log GDP per capita' in predicting happiness, suggesting a strong linkage between economic prosperity and overall life satisfaction. This finding underscores the criticality of economic development as a cornerstone for enhancing national well-being. Cluster analysis further unraveled patterns of happiness traits among countries, revealing groupings that resonate with global economic classifications and development levels. It delineated a spectrum of well-being, from affluent and stable countries to those facing significant socio-economic challenges. These clusters paint a vivid picture of the global landscape of happiness, emphasizing the multifaceted nature of development and the interplay of various factors influencing happiness.

Code¶

In [ ]:
# 1.1 Code
# Import Packages
import numpy as np
import pandas as pd
import altair as alt
import seaborn as sns
import plotly.express as px
import statsmodels.api as sm
import matplotlib.pyplot as plt

from sklearn import preprocessing
from IPython.display import Image # Insert pictures
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cdist #calculate distance

# disable row limit for plotting
alt.data_transformers.disable_max_rows()

# View the WHR Dataset
whr = pd.read_csv('data/whr-2023.csv')
whr.head()
In [ ]:
# Code of 2.1
# Selecting relevant columns for the correlation analysis
columns_of_interest = ['Life Ladder', 'Log GDP per capita', 'Social support', 
                       'Healthy life expectancy at birth', 
                       'Freedom to make life choices', 'Positive affect','Generosity', 
                       'Negative affect','Perceptions of corruption']
whr_subset = whr[columns_of_interest]

# Calculating the correlation matrix
corr_matrix = whr_subset.corr()

# Plotting the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Happiness Score and Socio-Economic Factors')
plt.show()


# Selecting independent variables
columns_of_interest2 = ['Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 
                        'Freedom to make life choices', 'Generosity', 'Perceptions of corruption',
                        'Positive affect','Negative affect']

# Handling missing and infinite values
whr_na = whr.replace([np.inf, -np.inf], np.nan)  
whr_na = whr_na.dropna(subset=columns_of_interest2 + ['Life Ladder']) 

X = whr_na[columns_of_interest2]
X = sm.add_constant(X)  # Adding a constant to the model

Y = whr_na['Life Ladder']

# Fitting the model
model = sm.OLS(Y, X).fit()

# Printing the summary
print(model.summary())
In [ ]:
# Code of 2.2
# Load the iso datasets
country_codes = pd.read_csv('data/iso-country-codes.csv')

# Find the 'Life Ladder' for the 1st and last available year
whr_first = whr.groupby('Country name').first().reset_index()
whr_last = whr.groupby('Country name').last().reset_index()

# Calculate the difference
whr_diff = pd.DataFrame()
whr_diff['Country name'] = whr_last['Country name']
whr_diff['Ladder Difference'] = whr_last['Life Ladder'] - whr_first['Life Ladder']

# Merge the country codes
merged_data = whr_diff.merge(country_codes, how='left', left_on='Country name', 
                             right_on='English short name lower case')

# Create the choropleth map
choropleth_map = px.choropleth(merged_data,
                    locations='Alpha-3 code',  
                    color='Ladder Difference',  
                    hover_name='Country name',  
                    color_continuous_scale=px.colors.sequential.Plasma, 
                    title='Change in Global Happiness Scores over Available Years')

# Show
choropleth_map.show("notebook")
In [ ]:
# Code of 2.3
# List of OECD member countries without abbreviations
oecd_member_countries = [
    'Australia', 'Austria', 'Belgium', 'Canada',
    'Chile', 'Colombia', 'Costa Rica', 'Czechia',
    'Denmark', 'Estonia', 'Finland', 'France',
    'Germany', 'Greece', 'Hungary', 'Iceland',
    'Ireland', 'Israel', 'Italy', 'Japan',
    'South Korea', 'Latvia', 'Lithuania', 'Luxembourg',
    'Mexico', 'Netherlands', 'New Zealand', 'Norway',
    'Poland', 'Portugal', 'Slovakia', 'Slovenia',
    'Spain', 'Sweden', 'Switzerland', 'Turkiye',
    'United Kingdom', 'United States'
]

all_countries = whr['Country name'].unique()  # Get a list of countries
oecd_membership = {country: 1 if country in oecd_member_countries   # Create a dictionary
                   else 0 for country in all_countries}
oecd_df = pd.DataFrame(list(oecd_membership.items()), columns=['Country name', 'OECD'])

# Merge the OECD with the WHR and difference datasets
whr_oecd = whr.merge(oecd_df, on='Country name', how='left')
whr_oecd = whr_oecd.merge(whr_diff, on='Country name', how='left')

# Group by OECD & calculate mean values
numeric_columns = whr_oecd.select_dtypes(include=['float64', 'int64']) # Selecting numeric cols
mean_values = numeric_columns.groupby(whr_oecd['OECD']).mean() 
mean_values_reduced = mean_values.drop(columns=['year'])  # Remove column

# Check if 'year' is in the DataFrame and then drop it if it is
numeric_columns = whr_oecd.select_dtypes(include=['float64', 'int64'])
if 'year' in numeric_columns:
    numeric_columns = numeric_columns.drop(columns='year')

# If 'OECD' is part of the numeric columns, drop it as well since we don't want to standardize it
if 'OECD' in numeric_columns:
    numeric_columns = numeric_columns.drop(columns='OECD')

# Standardize the numeric columns
scaler = preprocessing.StandardScaler()
standardized_values = scaler.fit_transform(numeric_columns)

# Create a DataFrame of the standardized values with original indices and column names
standardized_df = pd.DataFrame(standardized_values, index=numeric_columns.index, columns=numeric_columns.columns)

# Add back the 'OECD' column to the standardized DataFrame
standardized_df['OECD'] = whr_oecd['OECD']

# Melt the DataFrame for visualization
melted_standardized_df = standardized_df.melt(id_vars='OECD', var_name='Metric', value_name='Mean Value')

# Set style of the plot 
sns.set(style="whitegrid")

# Ploting
plt.figure(figsize=(10, 6))
barplot = sns.barplot(x='Metric', y='Mean Value', hue='OECD', data=melted_standardized_df, ci=None)
plt.title('Comparison of Mean Happiness Metrics by OECD Status (Excluding Life Expectancy)')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Mean Value')
plt.xlabel('Metric')

# Show
plt.tight_layout()
plt.show()
In [ ]:
# Code of 2.4
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

# Define predictor variables and target variable
X = whr_na[['Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 
         'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']]
y = whr_na['Life Ladder']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model instance
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
y_pred = linear_regressor.predict(X_test) # Make predictions
mse_lr = mean_squared_error(y_test, y_pred) # Evaluate the model
r2_lr = r2_score(y_test, y_pred)
# Print the model's coefficients and scores
print(f'For Linear Regression Model:')
print(f'Coefficients: {linear_regressor.coef_}')
print(f'Mean squared error: {mse_lr}')
print(f'Coefficient of determination: {r2_lr} \n')


# Create a Random Forest model instance
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_reg.fit(X_train, y_train)
y_pred_rf = random_forest_reg.predict(X_test) # Make predictions
mse_rf = mean_squared_error(y_test, y_pred_rf) # Evaluate the model
r2_rf = r2_score(y_test, y_pred_rf)
# Print the model's scores
print(f'For Random Forest Model:')
print(f'Mean squared error: {mse_rf}')
print(f'Coefficient of determination: {r2_rf}')


# Actual vs Predicted Happiness Scores using Random Forest
plt.figure(figsize=(8, 5))
plt.scatter(y_test, y_pred_rf, alpha=0.4)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # Diagonal line
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted Happiness Scores - Random Forest')
plt.show()


# # Get feature importance
feature_importance = random_forest_reg.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, # Create a DataFrame
                              'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False) # Sort in descending order

# Visualizing
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importances in Predicting Happiness Score')
plt.gca().invert_yaxis()
plt.show()
In [ ]:
# Code of 2.5
numeric_columns2 = whr.select_dtypes(include=['float64', 'int64']) # Selecting only numeric columns
mean_values_counrty = numeric_columns2.groupby(whr['Country name']).mean() # Group by OECD & calculate mean
mean_values_counrty2=mean_values_counrty.dropna().reset_index()
data1=mean_values_counrty2.drop(['year','Country name'], axis=1)
# data1.head()

# Standardization
z_scaler = preprocessing.StandardScaler()
data_z = z_scaler.fit_transform(data1)
data_z = pd.DataFrame(data_z)

# Normalization
minmax_scale = preprocessing.MinMaxScaler().fit(data_z)
data2 = minmax_scale.transform(data_z)
pd.DataFrame(data2).head()

# Calculate the mean distortions for different values of k
K = range(1,11)
meandistortions = []
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(data2)
    meandistortions.append(
        sum(np.min(cdist(data2,kmeans.cluster_centers_,'euclidean'), axis=1)
           ) / data2.shape[0]
    )

# Plot the Elbow Curve
plt.plot(K, meandistortions, 'bx--')
plt.xlabel('k')
plt.show()


# Initialize KMeans with 4 cluster (observed from the Elbow plot)
k_means = KMeans(init='k-means++', n_clusters=4, max_iter=500)
k_means.fit(data2)
label = k_means.fit_predict(data2)

data3 = mean_values_counrty2['Country name']
data4 = data3.unique()

# Check cluster results
data_type = pd.DataFrame(label)
data_type.columns = ['Clusters'] # Name the Column
whr_cluster = pd.merge(mean_values_counrty2, data_type, left_index=True, right_index=True)
# whr_cluster.head()

# Extract countries in cluster 0,1,2,3
cluster_0 = whr_cluster.loc[whr_cluster['Clusters'] == 0, 'Country name'].values
cluster_1 = whr_cluster.loc[whr_cluster['Clusters'] == 1, 'Country name'].values
cluster_2 = whr_cluster.loc[whr_cluster['Clusters'] == 2, 'Country name'].values
cluster_3 = whr_cluster.loc[whr_cluster['Clusters'] == 3, 'Country name'].values

print(f'Cluster 0: \n {cluster_0} \n')
print(f'Cluster 1: \n {cluster_1} \n')
print(f'Cluster 2: \n {cluster_2} \n')
print(f'Cluster 3: \n {cluster_3} \n')


# Boxplots in 3*3 format
fig, axes = plt.subplots(3, 3, figsize=(12, 12))

# List of features
features = ['Life Ladder', 'Log GDP per capita', 'Social support',
            'Healthy life expectancy at birth', 'Freedom to make life choices',
            'Positive affect', 'Generosity', 'Negative affect',
            'Perceptions of corruption']

# Plotting
for i, feature in enumerate(features):
    row = i // 3  # row index
    col = i % 3   # col indext
    sns.boxplot(x='Clusters', y=feature, data=whr_cluster, ax=axes[row, col])
    axes[row, col].set_title(f'Distribution of {feature} by Cluster')

plt.tight_layout()
plt.show()